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Abstract 

We apply decision tree induction to the prob- 
lem of discourse clue word sense disambigua- 
tion. The automatic partitioning of the train- 
ing set which is intrinsic to decision tree induc- 
tion gives rise to linguistically viable rules. 



Introduction 

^^iscourse clue words function to convey information 
about the topical flow of a discourse. Clue words can 
used both to bracket discourse segments and to 
Jescribe the discourse relationship between these seg- 
ments. For example, say can introduce a set of exam- 
^.^les, as in "Should terrestrial mammals be taken in the 
,_^ame breath as types of mammals as say persons and 
apes?"^ However, each word in Table 1 has at least one 
i^alternative meaning where the word contributes not to 
LnJiscourse level semantics, but to the semantic content 
^—^f individual sentences; this is termed its sentential 
meaning. For example, say can mean "To express in 
Chords" , as in "I don't want to say that he chickened out 
■^at a presentation or anything but he is in Toronto..." 
OTherefore, to take advantage of the information sup- 
^T&Jied by discourse clue words, a system must first be 
F-^t)le to disambiguate between such a word's sentential 
Jand discourse senses.^ 

a In this paper, we perform automatic decision tree in- 
uction for this problem of discourse clue word disam- 
guation using a genetic algorithm. We show several 
advantages to our approach. First, the different decision 
trees that result encode a variety of linguistic general- 
izations about clue words. We show how such linguistic 
rules emerge automatically from the training set par- 
titioning which occurs during decision tree induction. 
These rules can be examined in order to evaluate the 
validity of induced decision trees. Examining the rules 
also provides insights as to the type of syntactic informa- 
tion necessary to further improve clue word sense disam- 
biguation. Second, decision trees are induced which gen- 



^The examples come from the corpus used in this study. 
See Hirschberg and Litman [1993] and Schiffrin [1987] 
for details on other clue words and more information about 
clue words in general. In this paper, clue word refers to a 
word from Table 1, regardless of the particular sense with 
which it occurs. 



eralize across a set of 34 clue words (see Table 1) in con- 
trast to previous automated approaches to word sense 
disambiguation which typically have focused on discrim- 
inating the senses of one word at a time [Schuetze 1992] 
[Brown et al 1991] [Leacock et al 1993] [Black 1988] [Gr- 
ishman and Sterling 1993] [Yarowsky 1993]. As we show, 
this allows for greater learning potential than dealing 
with words individually. There are some problems with 
the domain of disambiguation for clue words and we dis- 
cuss these, indicating why our approach is likely to be 
more helpful for other disambiguation problems. 

The following four sections discuss previous work on 
disambiguation, describe our approach, present experi- 
mental results in both linguistic and numerical terms, 
and draw conclusions and present our future research 
directions. 



Previous Work 

Hirschberg and Litman [1993] explore several methods 
for disambiguating clue words, including measuring the 
ability with which this task can be performed by looking 
only at the punctuation marks immediately before and 
after a clue word, suggesting the strategy embodied by 
the decision tree in Figure 1.^ This small decision tree 
classifies clue words as discourse exactly when there 
is a period or a comma immediately preceding, and as 
sentential in all other cases. This means, for example, 
that a word is classified as discourse when it is the first 
word of a sentence. For such a simple strategy, the deci- 
sion tree performs to a relatively high degree of accuracy 
over our corpus: 79.16%. Our work investigates disam- 
biguation strategies for clue words which involve looking 
at near-by words in addition to punctuation marks. 

The automatic acquisition of disambiguation strate- 
gies has been applied to many types of ambiguity prob- 
lems, including word sense disambiguation [Schuetze 
1992] [Leacock et al 1993] [Yarowsky 1993] [Brown et 
al 1991], determiner prediction [Knight forthcoming], 
and several parsing problems [Resnik 1993] [Mager- 
man 1993]. Previous work using decision tree induc- 
tion for disambiguation includes work by Black [1988] 



•^This decision tree is a slightly simplified extrapolation of 
Table 11 from Hirschberg and Litman [1993]. Hirschberg and 
Litman [1993] also investigated the ocurrence of clue words 
adjacent to one another, but with no conclusive results. 



Table 1: Discourse clue words and the fraction of times each is used in its discourse sense. 
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Figure 1: Manually created decision tree with accuracy 
79.16%. 

(word sense disambiguation), Knight [forthcoming] (de- 
terminer prediction), Resnik [1993] (coordination pars- 
ing) and Magerman [1993] (syntactic parsing). Auto- 
matic approaches to word sense disambiguation have 
thus far primarily focussed on disambiguating one word 
at a time. 

Approach 

In this study, we expand on the orthographic approach 
to clue word disambiguation described by Hirschberg 
and Litman [1993] by allowing decision trees to test not 
only for adjacent punctuation marks and clue words, 
but also for near-by words of any kind, and by allowing 
the decision trees to discriminate between clue words. 
The set of attributes available to a decision tree are the 
tokens (words and punctuation marks) appearing imme- 
diately to the left of the ambiguous word, immediately 
to the right of the ambiguous word, and 2, 3, and 4 
spaces to the right of the ambiguous word, as well as 
the ambiguous word itself {attribute 0), that is, {-1, 0, 
1, 2, 3, 4}. This set of attributes were selected to test 
whether the decision trees would find a wider window 
of tokens useful for clue word disambiguation, but, as 
was automatically determined, only the adjacent tokens 
and the ambiguous word itself were deemed useful. No 
information describing syntactic structure is explicitly 
available to decision trees. The genetic algorithm de- 
termines automatically which words or punctuation in 
these positions are important for disambiguation. 

Decision Trees 

Figures 2 and 3 show example decision trees which were 
automatically induced for clue word sense disambigua- 
tion. Internal nodes are labeled with token positions 




Figure 2: Decision tree automatically induced by the ge- 
netic algorithm. This tree disambiguated with 81.10% 
accuracy over the training set, and with 82.30% accu- 
racy over the test set. 



{attributes), arcs are labeled with sets of tokens {val- 
ues), and leaves are labeled with classes, that is, either 
discourse or sentential. Given a text fragment con- 
taining a clue word, a decision tree classifies the word 
as to its sense by a deterministic traversal of the tree, 
starting at the root, down to a leaf. During traversal, 
an arc descending from the current (internal) node is 
selected in order to continue the traversal. This arc is 
chosen by finding the first descending arc, going from 
left to right, containing the token at the text fragment 
position indicated by the current node's label. For ex- 
ample, to traverse the tree in Figure 2, starting at the 
root node, the leftmost arc is traversed if the word at 
position is one of the words on the arc (e.g., say). 
The rightmost arc under each internal node is labeled 
"default" , and is traversed when none of its sister arcs 
contain the correct token. 

In order to increase the likelihood that an induced de- 
cision tree will embody valid generalizations, as opposed 
to being over-fitted to the particular set of training ex- 
amples, only the tokens which appear with frequency 
above a threshold of 15 in the training cases are permit- 
ted in the value sets of a decision tree (see the subsection 
"The Training Data" for details on the training corpus). 



Figure 3: Decision tree automatically induced by the genetic algorithm. This tree disambiguated with 84.99% 
accuracy over the training set, and with 82.30% accuracy over the test set. 



specifically:'* 

{<period>, <comma>, <apostrophe-s>, a, and, are, as, 
at, can, for, I, m, is, it, of, that, the, this, to, we, you} 

A separate set of tokens is available to the arcs under 
nodes labeled 0, namely the discourse clue words which 
appear with frequency greater than 4 in the training 
cases. (Only clue words appear at position 0.) This 
threshold was chosen to allow infrequent clue words to 
be specified by a decision tree, but to still avert over- 
fitting to the training data. 

Decision Tree Induction 

The corpus used in this study supplies 1,027 examples. 
Table 2 shows sample data. Each training case has a 
manually specified class, and a value corresponding to 
each of 6 attributes. In order to predict the performance 
of an induced decision tree over unseen data, the induc- 
tion procedure is run over a random half of the corpus 
(the training set), and the resulting decision tree is then 
evaluated over the remaining half of the corpus (the test 
set). This division of the data is performed randomly 
before each run.® 

The induction procedure used in this study is a ge- 
netic algorithm (GA) [Holland 1975], a weak learning 
method which has been applied to a wide range of tasks 
in optimization, machine learning, and automatic com- 

*Tokens are case-insensitive (capitalization doesn't mat- 
ter), but inflection-sensitive (a is different than an). 

Because of this random division, the frequency distribu- 
tion of tokens in the training set varies, so the valid token 
and clue word sets actually varies slightly. 



Table 2: Example training cases. 
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puter program induction [Goldberg 1989] [Koza 1992]. 
Inspired by Darwinian survival of the fittest, the GA 
works with a pool (population) of individuals, stochasti- 
cally performing reproductive operators on the individu- 
als, depending on some notion of fitness. Reproductive 
operators include crossover, a stochastic procedure by 
which two individuals are combined to create a third, 
and mutation, by which an individual undergoes a ran- 
dom alteration. In our work, individuals are both de- 
cision trees and the token sets which correspond to de- 
cision tree arcs. Fitness corresponds to the number of 
training cases correctly classified by a decision tree. The 
GA outputs the highest fit decision tree it encounters. 
Siegel [1994] describes the details of GA decision tree 
induction applied in this work. Subsection "Numerical 
Results" in this paper contrasts GA decision tree induc- 
tion to classical decision tree induction techniques. 

The Training Data 

The 1,027 training examples come from a corpus used 
by Hirschberg and Litman [1993]. This is a transcript 
of a single speaker speech, preceeded by introductory 
remarks by other speakers, in which each occurrence of 
the words in Table 1 has been manually marked as to its 



meaning by a linguist.® When marking the corpus, the 
linguist had access to the entire transcript plus a record- 
ing of the speech. Therefore, much more information 
was available to the linguist than there is to a decision 
tree. Regardless, about 7% were deemed ambiguous by 
the linguist. The "ambiguous" examples were left out 
of this study since they provide no information on how 
to disambiguate. 407 of the 1,027 unambiguous lexi- 
cal items (39.63%) were marked as discourse, and 620 
(60.37%) were marked as sentential. See Hirschberg 
and Litman [1993] for more detail on the corpus and 
the distribution of data within it. 

Results 

Since the division between training and test cases is 
random for each run, and since the GA is a stochas- 
tic method, each run of the GA gives rise to a unique 
decision tree.^ We performed 58 runs, thus generating 
58 trees. We evaluate these trees in two ways. First, by 
manually examining several high scoring trees, we show 
they yield linguistically valid rules. Second, we measure 
the average performance of induced decision trees. 

Linguistic Results 

The small decision tree in Figure 1, a tree obtained man- 
ually by Hirschberg and Litman [1993], yields an accu- 
racy of 79.16%. To attain any improvement in accuracy, 
a more complex pariiiiomng of the training cases must 
take place, by which the GA focuses on the cases where 
the majority of error lies. It is by this partitioning pro- 
cess that additional linguistic rules are induced. 

A decision tree implicitly partitions the training (and 
test) cases; each rule embedded in a decision tree corre- 
sponds to a partition. As an example, the small decision 
tree in Figure 1 corresponds to the following simple par- 
titioning of the training data: 

-1 = <period> is true for 189 cases (185 discourse). 
-1 = <comma> is true for 72 cases (42 discourse). 
766 cases remain (180 discourse). 

In order to attain a higher accuracy than that of the 
small decision tree, the partition consisting of the 766 
"remaining" cases, for example, is a viable candidate 
for re-partitioning - rules must be found which apply to 
subpartitions of that partition. As we show here, many 
of the induced rules tend to be linguistically viable. 

There are two ways to examine the resulting rules. 
First, we identify general rules that apply to sets of clue 

^We used one linguist's markings, whereas Hirschberg and 
Litman [1993] used and correlated the judgements of that 
and another linguist, discarding those cases in which there 
was disagreement. Thus, the data we used was slightly more 
noisy than that used by Hirschberg and Litman [1993]. Fur- 
ther, we used a slightly larger portion of the marked tran- 
script than is reported on by Hirschberg and Litman [1993]. 

'^Technically, there is a very small possibility that the 
same decision tree will be induced by two different runs of 
the GA. 



words (i.e., more than one) from several trees. In par- 
ticular, we note that different trees yield different gen- 
eralizations. Second, we identify all generalizations en- 
coded in high scoring trees for individual clue words. 
These generalizations identify the rules that, in combi- 
nation, can be used for a single clue word. In analyzing 
these generalizations, we note where they are specific to 
the corpus and where we expect them to generalize to 
different domains. 

Multiple Clue Word Rules. Table 3 displays exam- 
ple linguistic rules extracted from various decision trees, 
and lists the clue words to which they apply. Each rule 
consists of a comparison (under column "If") and the 
clue word sense which results if the comparison holds 
(under column "Then" - "S" stands for sentential and 
"D" stands for discourse). The "Linguistic Template" 
column indicates the most frequent part of speech of the 
clue word when the comparison holds, as determined 
manually, and is elaborated below. "Accuracy" shows 
the number of cases in the corpus for which the rule 
holds, divided by the number of cases in the corpus 
which match the pattern. 

These rules strongly suggest strategies by which part 
of speech is used for disambiguation; the rules embody 
the fact that a clue word's sense is sentential if its 
part of speech is not a conjunction, and must be further 
disambiguated if it is a conjunction. 

The first rule classifies an occurrence of either see, 
look, further or say as Sentential if position -1 is to. 
(These are the clue words for which this rule holds in 
the corpus.) Of the 30 times for which this condition 
holds, the rule is correct 29; the rule holds exactly when 
the listed words are behaving as verbs, as indicated by 
the linguistic template "<o <verb>", e.g.: 

...we can foster this integration of AI techniques and 
database technology to further the goal of integrating 
the two fields into Expert Database Systems. 

This example is in fact the only occurrence of further 
in the corpus for which to is the immediately preceding 
token. However, the GA can induce this rule since it 
is generally applicable over the 4 clue words (as shown 
in the tree of Figure 3). This demonstrates the ben- 
efit gained by simultaneously disambiguating multiple 
words. 

The second rule listed (100% accuracy) embodies two 
different "syntactic templates" . Both are detected by 
checking for -1 = the. The first, which occurs for like, 
and and right, determines that the sense is sentential 
if the clue word is being used as a noun®, as in: 

...a lot of work going on now m what's called non- 
monotonic reasoning, circumscription and the like,,, 

and the second, which occurs for right, first and next, 
determines that the sense is sentential if the clue word 
is being used as an adjective in a noun phrase, as in: 



^ And is a noun when it is used to refer to the logical 
operator. 



...I ihmk this is the first time those three are cooperat- 
ing... 

The third rule (90.11% accuracy) pinpoints the collo- 
cation "as well". When in this collocation, well is being 
used as an adverb, e.g.: 

We could have just as well done without it but the sys- 
tem would run a lot more slowly. 

The fourth rule (76.92% accuracy), which applies to 
the 8 clue words listed, approximates the cases where a 
clue word is being used as an adverb, as in: 

And then m the summer of 1985 Ron left the West Coast 
to travel east to New Jersey where he is now at AT&T 
Bell Laboratories as head of the AI Principles Research 
Department. 

However, the condition "-1 = «s" holds for some cases 
in which a clue word is used in its discourse sense, as 
in: 

...and the second question is well where do we stop. 

Therefore, this particular rule is too simplistic for some 
cases. However, it has indicated for us a disambiguation 
method which uses the part of speech of the clue word. 

Single Clue Word Rules. Table 4 shows the way 
the decision trees in Figures 2 and 3 disambiguate and 
and say, respectively. The decision trees are explicitly 
broken down into the rules used to disambiguate the 
individual clue words. The columns in the table are 
the same as the previous table, with the addition of 
"Decision tree" , which points to the tree being analyzed. 
The rules for each word are listed in the order in which 
they are considered when traversing the decision tree. 
Therefore, for example, the condition of the fourth rule 
for and is only tried on cases for which -1 is none of 
<period>, <comma>, or is, and this is reflected in the 
number of cases for which the condition holds, as listed 
in the "Accuracy" column. This number of occurrences 
is a count across the entire corpus; that is, both the 
training and test cases. The overall accuracy with which 
the example decision trees disambiguate the individual 
clue words is also shown. 

The rules for and reflect the fact that, when coordi- 
nating noun phrases, and is usually being used in its 
sentential sense, and, when coordinating clauses, and 
is most often being used in its discourse sense. 

The first two rules for and are the same as the first two 
rules of the small decision tree in Figure 1. The third 
and eighth rules hold for too few examples to draw any 
conclusions. The condition of the fourth rule approxi- 
mates the cases for which and is being used to coordinate 
noun phrases, since most definite noun phrases are not 
the subject of a clause in the corpus. For example: 

...Fve been very lucky to be able to work with Don Mar- 
shand and the institute m organizing this... 

This is clearly too simple a strategy (64.29% accuracy), 
but provides insight for improved strategies. 

The fifth, sixth and seventh rules (75.00%, 85.71% 
and 83.33% accuracy) approximate the cases for which 



and is coordinating clauses, since /, we and this are most 
frequently the subject of a clause in the corpus, as in: 

The idea of the tutorial sessions was precisely to try to 
bring people up to speed m areas that they might not be 
familiar with and / hope the tutorials accomplish that 
for you. 

The small tree of Figure 1, which disambiguates in 
general with accuracy 79.16%, only disambiguates and 
with accuracy 71.84% (The small tree disambiguates the 
occurrences of clue words other than and with accuracy 
82.92%). However, the overall accuracy with which the 
decision tree in Figure 2 disambiguates and is 76.4:4%. 

The decision tree in Figure 3 treats say differently 
and separately from the other clue words: After the 
first default arc is traversed, say is always disambiguated 
as discourse, while other words are treated differently 
(e.g., well is further tested). This demonstrates the util- 
ity of allowing decision trees to discriminate between 
clue words, since say occurs with sense discourse more 
frequently than most other clue words; say is only dis- 
ambiguated with accuracy 41.67% by the small tree of 
Figure 1, but is disambiguated with accuracy 83.33% by 
the induced decision tree of Figure 3. 

The cases for which the tree in Figure 3 classifies say 
as sentential are when say behaves as a verb, as in: 

That IS if I say that John is both a Quaker and a Re- 
publican... 

As demonstrated by the contents of Table 4, most in- 
stances of clue words in the corpus are disambiguated by 
rules which hold with high accuracy, as measured across 
the entire corpus, while the decision trees were induced 
over only half of the corpus (the training set). This 
indicates that performance will remain high for unseen 
examples from similar corpora. 

Numerical Results 

From 58 runs of the GA, each with a random division 
between training and test cases, the maximum score over 
the test cases was 83.85%.® The performance of such 
a tree over unseen data ideally requires further formal 
evaluation with more test data. 

The average score over the test cases for the 58 runs 
was 79.20%. The average disparity between training and 
test scores, 2.64, is not large. Therefore, the rules of 
induced decision trees tend to perform well over unseen 
data, although it is inconclusive whether their combined 
contribution to disambiguation accuracy improves over 
the overall performance of the small tree in Figure 1 
(79.16%) for the entire set of clue words. However, de- 
cision trees clearly aid in the disambiguation of several 
of the clue words, e.g. say and and. 

These results refiect the difficulty inherent to the task 
of clue word sense disambiguation. Hirschberg and Lit- 

^This is the maximum test score of the decision trees 
which performed the best of their run over the training 
cases. This same pool of trees is considered for average test 
performance. 



Table 3: Linguistic rules extracted from various automatically induced decision trees. 
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Table 4: The rules used by sample trees to disambiguate and and say. 
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man [1993] report that 7.87% of the examples manually 
marked by the authors were either disagreed upon by 
the authors, or were decidedly ambiguous. 

Many disambiguation tasks will presumably not have 
a simple strategy (such as that embodied by the small 
decision tree in Figure 1) which performs to such a high 
degree of accuracy. For example, the aspectual clas- 
sification of a clause requires the interaction of sev- 
eral syntactic constituents of the clause [Pustejovsky 
1991]. Therefore, since the disparity between train- 
ing and test performance is moderate, decision tree in- 
duction is likely, in general, to outperform such simple 
strategies for disambiguation tasks. 

As a benchmark, several top-down (recursive partt- 
iiomng) decision tree induction methods [Quinlan 1986] 
[Breiman et al 1984] were applied to the disambigua- 
tion corpus. This comparison was motivated by the 
fact that top-down decision tree induction is the more 
established method for decision tree induction The best 
top-down method disambiguated the test cases with ac- 
curacy 79.06% on average (based on 200 runs, each with 
a random division between training and test sets), which 
is comparable to the GA's average performance, 79.20%. 

GAs are a weak learning method, which often require 
less explicit engineering of heuristics than top-down in- 
duction. For an investigation of the generalization per- 
formance of GA decision tree induction see Siegel [1994]. 



^"These experiments were performed using the IND deci- 
sion tree induction package [Buntine and Caruana 1991]. 



Tackett [1993] and Greene & Smith [1987] have also per- 
formed comparisons between GA techniques and recur- 
sive partitioning methods. 

Conclusions and Future Work 

The disambiguation of and and say, as well as other clue 
words, has benefited from the integration of knowledge 
about surrounding words, without the explicit encoding 
of syntactic data. Further, we have demonstrated that 
the automatic partitioning of the training set during de- 
cision tree induction provides an array of linguistically 
viable rules. These rules provide insights as to syntactic 
information which would be additionally beneficial for 
clue word sense disambiguation. Further, the rules can 
help linguists evaluate the validity of induced decision 
trees. 

We have demonstrated the utility of disambiguating a 
set of words simultaneously: generalizations which apply 
over several words are induced, and, when training over 
a small corpus, this allows generalizations to be made on 
examples that occur extremely infrequently (e.g., once). 

We plan to apply machine learning methods to aspec- 
tual ambiguity. The aspectual class of a clause depends 
on a complex interaction between the verb, its particles, 
and its arguments [Pustejovsky 1991]. Induction will 
be performed simultaneously over a set of verbs, with 
access to the syntactic parse of example clauses. 
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